Registration number: 19BCE1717
Faculty: Dr. C. Sweetlin Hemalatha
Slot: L39 + L40
Course code: CSE3505

Regression analysis

Dataset - processing

Import the dataset “CarPrice_Assignment.csv”

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.2
## Loading required package: lattice
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.3     ✓ purrr   0.3.4
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'purrr' was built under R version 4.0.2
## Warning: package 'stringr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
data <- read.csv("CarPrice_Assignment.csv")
head(data)
##   car_ID symboling                  CarName fueltype aspiration doornumber
## 1      1         3       alfa-romero giulia      gas        std        two
## 2      2         3      alfa-romero stelvio      gas        std        two
## 3      3         1 alfa-romero Quadrifoglio      gas        std        two
## 4      4         2              audi 100 ls      gas        std       four
## 5      5         2               audi 100ls      gas        std       four
## 6      6         2                 audi fox      gas        std        two
##       carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 1 convertible        rwd          front      88.6     168.8     64.1      48.8
## 2 convertible        rwd          front      88.6     168.8     64.1      48.8
## 3   hatchback        rwd          front      94.5     171.2     65.5      52.4
## 4       sedan        fwd          front      99.8     176.6     66.2      54.3
## 5       sedan        4wd          front      99.4     176.6     66.4      54.3
## 6       sedan        fwd          front      99.8     177.3     66.3      53.1
##   curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 1       2548       dohc           four        130       mpfi      3.47   2.68
## 2       2548       dohc           four        130       mpfi      3.47   2.68
## 3       2823       ohcv            six        152       mpfi      2.68   3.47
## 4       2337        ohc           four        109       mpfi      3.19   3.40
## 5       2824        ohc           five        136       mpfi      3.19   3.40
## 6       2507        ohc           five        136       mpfi      3.19   3.40
##   compressionratio horsepower peakrpm citympg highwaympg price
## 1              9.0        111    5000      21         27 13495
## 2              9.0        111    5000      21         27 16500
## 3              9.0        154    5000      19         26 16500
## 4             10.0        102    5500      24         30 13950
## 5              8.0        115    5500      18         22 17450
## 6              8.5        110    5500      19         25 15250
str(data)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : chr  "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr  "std" "std" "std" "std" ...
##  $ doornumber      : chr  "two" "two" "two" "four" ...
##  $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation  : chr  "front" "front" "front" "front" ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber  : chr  "four" "four" "six" "four" ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

Dropping non-numeric columns:

# Remove the ID column too as it is the unique identifier for a tuple and not an actual feature
data = subset(data, select = -c(car_ID,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,enginetype,cylindernumber,fuelsystem))

str(data)
## 'data.frame':    205 obs. of  15 variables:
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

The dataset has columns that are numerical or integral only. The chr datatypes are removed as they cannot be used in regression modelling

Checking for na values:

print(paste("Number of missing values = ", sum(is.na(data))))
## [1] "Number of missing values =  0"

The dataset is clean! We can use it for visualisation and modelling

Plotting

Pair-plot:

pairs(data[,c("symboling","wheelbase","wheelbase","carlength","carwidth","carheight", "curbweight", "enginesize", "boreratio", "price")])

pairs(data[, c("stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg", "price")])

Since the plot was too small I made two set of pair-plots to make it easier to analyse. From an initial glance, there are several features that have a linear relation. With respect to the label, price, we will see how the correlation is for each feature in the dataset.

Correlation:

cols = colnames(data)
for (c in cols) {
  print(paste("Column", c, ": ", cor(data[[c]],data$price)))
}
## [1] "Column symboling :  -0.0799782246427035"
## [1] "Column wheelbase :  0.57781559829215"
## [1] "Column carlength :  0.682920015677962"
## [1] "Column carwidth :  0.759325299741511"
## [1] "Column carheight :  0.119336226570494"
## [1] "Column curbweight :  0.835304879337296"
## [1] "Column enginesize :  0.874144802524512"
## [1] "Column boreratio :  0.553173236798444"
## [1] "Column stroke :  0.079443083881931"
## [1] "Column compressionratio :  0.0679835057994427"
## [1] "Column horsepower :  0.808138822536222"
## [1] "Column peakrpm :  -0.0852671502778569"
## [1] "Column citympg :  -0.68575133602704"
## [1] "Column highwaympg :  -0.697599091646557"
## [1] "Column price :  1"

I wrote a simple for loop to print the correlation value between the price and all the other attributes in the dataset. The values are printed above. Some attributes have very low correlation while some have high negative or positive correlation. Before we remove the low correlation attributes, let us build a linear regression model and analyse the R squared values.

Linear regression Model (Initial - with all features)

Linear model:

model <- lm(price~.,data=data)
model
## 
## Call:
## lm(formula = price ~ ., data = data)
## 
## Coefficients:
##      (Intercept)         symboling         wheelbase         carlength  
##       -51650.650           285.883           167.699           -94.818  
##         carwidth         carheight        curbweight        enginesize  
##          466.618           194.752             1.878           116.782  
##        boreratio            stroke  compressionratio        horsepower  
##         -984.428         -3056.162           286.475            32.501  
##          peakrpm           citympg        highwaympg  
##            2.358          -286.940           191.304

Summary:

summary(model)
## 
## Call:
## lm(formula = price ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10220.7  -1636.8   -118.2   1500.0  14454.4 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -5.165e+04  1.566e+04  -3.299 0.001160 ** 
## symboling         2.859e+02  2.433e+02   1.175 0.241523    
## wheelbase         1.677e+02  1.075e+02   1.561 0.120256    
## carlength        -9.482e+01  5.550e+01  -1.708 0.089200 .  
## carwidth          4.666e+02  2.480e+02   1.882 0.061425 .  
## carheight         1.948e+02  1.382e+02   1.409 0.160480    
## curbweight        1.878e+00  1.736e+00   1.082 0.280718    
## enginesize        1.168e+02  1.383e+01   8.443 7.82e-15 ***
## boreratio        -9.844e+02  1.195e+03  -0.824 0.410979    
## stroke           -3.056e+03  7.780e+02  -3.928 0.000120 ***
## compressionratio  2.865e+02  8.342e+01   3.434 0.000730 ***
## horsepower        3.250e+01  1.626e+01   1.998 0.047105 *  
## peakrpm           2.358e+00  6.703e-01   3.518 0.000544 ***
## citympg          -2.869e+02  1.799e+02  -1.595 0.112288    
## highwaympg        1.913e+02  1.599e+02   1.196 0.233040    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3186 on 190 degrees of freedom
## Multiple R-squared:  0.8519, Adjusted R-squared:  0.841 
## F-statistic: 78.05 on 14 and 190 DF,  p-value: < 2.2e-16

From the summary, it is clear that the significant variables are enginesize, stroke, compressionratio, horsepower,peakrpm.

Sigma and Confidence interval:

sigma(model)
## [1] 3186.001
confint(model)
##                          2.5 %        97.5 %
## (Intercept)      -8.253787e+04 -20763.425148
## symboling        -1.941014e+02    765.867256
## wheelbase        -4.424985e+01    379.647879
## carlength        -2.042966e+02     14.660896
## carwidth         -2.255920e+01    955.796172
## carheight        -7.789748e+01    467.401930
## curbweight       -1.546010e+00      5.301211
## enginesize        8.949968e+01    144.064280
## boreratio        -3.341025e+03   1372.169374
## stroke           -4.590881e+03  -1521.443381
## compressionratio  1.219180e+02    451.032518
## horsepower        4.199273e-01     64.582774
## peakrpm           1.035892e+00      3.680459
## citympg          -6.417103e+02     67.830849
## highwaympg       -1.241085e+02    506.715696

Residuals vs Fitted:

#Evaluating the residuals
model1 <- lm(log(price)~.,data=data)
plot(model1,1)

The R squared (multiple) shows a relatively high value of approximately 0.85. Now, let us see if removing low correlation attributes and splitting test-train helps improve the performace of the model.

There are 2 Linear regression model below. One model is based on the high correlation values and the other is based on high significant attributes.

1) Linear regression model (high correlation attributes only)

The attributes with high correlation values are: 1) Wheelbase : 0.57781559829215" 2) Carlength : 0.682920015677962" 3) Carwidth : 0.759325299741511" 4) Curbweight : 0.835304879337296" 5) Enginesize : 0.874144802524512" 6) Boreratio : 0.553173236798444" 7) Horsepower : 0.808138822536222" 8) Citympg : -0.68575133602704" 9) Highwaympg : -0.697599091646557"

Attributes 1-7 are having positive correlation while 8 and 9 have negative correlation. In general, a correlation of around 0.5 is weak while 0.8 is high. Hence, Attributes 1 and 6 can be considered weak correlations while the rest are moderate to high correlations. Let us see if building a linear regression model with these 9 attributes results in a better performace.

data = subset(data, select = c(wheelbase,carlength,carwidth,curbweight,enginesize,boreratio,horsepower,citympg,highwaympg,price))
str(data)
## 'data.frame':    205 obs. of  10 variables:
##  $ wheelbase : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength : num  169 169 171 177 177 ...
##  $ carwidth  : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ curbweight: int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginesize: int  130 130 152 109 136 136 136 136 131 131 ...
##  $ boreratio : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ horsepower: int  111 111 154 102 115 110 110 110 140 160 ...
##  $ citympg   : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg: int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price     : num  13495 16500 16500 13950 17450 ...

Plotting and visualisation

library(ggplot2)
par(mar=c(2,2,2,2))
cols = colnames(data)

# Wheelbase
ggplot(data,aes(x=wheelbase,y=price))+geom_point()

ggplot(data,aes(x=wheelbase,y=price))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data,aes(x=wheelbase,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# carlength
ggplot(data,aes(x=carlength,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# carwidth
ggplot(data,aes(x=carwidth,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# curbweight
ggplot(data,aes(x=curbweight,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# enginesize
ggplot(data,aes(x=enginesize,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# boreratio
ggplot(data,aes(x=boreratio,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# horsepower
ggplot(data,aes(x=horsepower,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# citympg
ggplot(data,aes(x=citympg,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# highwaympg
ggplot(data,aes(x=highwaympg,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

From the plots above we have visualised the trends using the ggplot2 library. We have also seen the correlation values before for the features above. Now we can train the model.

Train-test ratio is 80:20

set.seed(123)
train_samples <- data$price %>% 
  createDataPartition(p=0.8,list=FALSE)
head(train_samples)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         3
## [4,]         5
## [5,]         6
## [6,]         7
#c("wheelbase","carlength","carwidth","curbweight","enginesize","boreratio","horsepower","citympg","highwaympg","price")
train <- data[train_samples,]
test <- data[-train_samples,]
model2 <- lm(price~.,data=train)
summary(model2)
## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8241.9 -1641.7   -88.9  1188.1 14644.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38522.555  15715.575  -2.451   0.0153 *  
## wheelbase      142.999    113.076   1.265   0.2079    
## carlength      -52.807     66.230  -0.797   0.4265    
## carwidth       437.487    297.385   1.471   0.1433    
## curbweight       2.816      1.718   1.639   0.1032    
## enginesize      93.554     14.684   6.371 2.03e-09 ***
## boreratio    -1790.727   1353.372  -1.323   0.1877    
## horsepower      42.427     17.471   2.428   0.0163 *  
## citympg       -158.173    210.504  -0.751   0.4536    
## highwaympg     142.820    193.937   0.736   0.4626    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3427 on 155 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.8195 
## F-statistic: 83.71 on 9 and 155 DF,  p-value: < 2.2e-16

Since the degree of freedom is very high (155), we have to look at the adjusted R squared and also explains why the Median is very low.

pred <- model2 %>%
  predict(test)
RMSE <- RMSE(pred,test$price)
RMSE
## [1] 3600.704
R2 <- R2(pred,test$price)
R2
## [1] 0.7864661

On the test set, the R squared value is 0.79

sigma(model2)
## [1] 3427.2
sigma(model2)*100/mean(train$price)
## [1] 25.69429
confint(model2)
##                     2.5 %      97.5 %
## (Intercept) -6.956690e+04 -7478.21008
## wheelbase   -8.036922e+01   366.36661
## carlength   -1.836357e+02    78.02251
## carwidth    -1.499640e+02  1024.93827
## curbweight  -5.780033e-01     6.20993
## enginesize   6.454852e+01   122.55986
## boreratio   -4.464162e+03   882.70669
## horsepower   7.914860e+00    76.93820
## citympg     -5.740000e+02   257.65381
## highwaympg  -2.402817e+02   525.92087
par(mar=c(2,2,2,2))
hist(model2$residuals)

The mean of the residuals is close to zero and it is very close to the normal distribution.

qqnorm(model2$residuals,ylab = "Residuals")
qqline(model2$residuals)

Most points lie on the straight line, hence this can be considered sufficient.

INFERENCE of LR model with high correlation attributes: The value of R squared adjusted is slightly lower than we had in the first model where we did not split the dataset into train and test set. The value above was .85 and here it is 0.82, but the values cannot be compared directly as the train set was different and the size was smaller in the latter. Therefore, the reduction of 0.03 units is not significant and does not necessarily mean it is a bad model. In the following section we will see how the model is with only the significant attributes.

2) Linear regression model with high significance attributes:

The significant variables are enginesize, stroke, compressionratio, horsepower,peakrpm.

data <- read.csv("CarPrice_Assignment.csv")
data = subset(data, select = -c(car_ID,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,enginetype,cylindernumber,fuelsystem))
data = subset(data, select = c(enginesize,stroke,compressionratio,horsepower,peakrpm,price))
str(data)
## 'data.frame':    205 obs. of  6 variables:
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

Plotting and visualisation

par(mar=c(2,2,2,2))
cols = colnames(data)

# enginesize
ggplot(data,aes(x=enginesize,y=price))+geom_point()

ggplot(data,aes(x=enginesize,y=price))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data,aes(x=enginesize,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# stroke
ggplot(data,aes(x=stroke,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# compressionratio
ggplot(data,aes(x=compressionratio,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# horsepower
ggplot(data,aes(x=horsepower,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

# peakrpm
ggplot(data,aes(x=peakrpm,y=price))+geom_point()+geom_smooth(method='lm',se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

set.seed(123)
train_samples <- data$price %>% 
  createDataPartition(p=0.8,list=FALSE)
head(train_samples)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         3
## [4,]         5
## [5,]         6
## [6,]         7
train <- data[train_samples,]
test <- data[-train_samples,]
model3 <- lm(price~.,data=train)
summary(model3)
## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13832.4  -1516.3   -367.3   1623.6  13640.4 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.233e+04  4.921e+03  -2.506  0.01321 *  
## enginesize        1.407e+02  1.333e+01  10.549  < 2e-16 ***
## stroke           -2.580e+03  8.767e+02  -2.942  0.00374 ** 
## compressionratio  3.176e+02  7.401e+01   4.292 3.07e-05 ***
## horsepower        4.502e+01  1.382e+01   3.257  0.00138 ** 
## peakrpm           1.596e+00  7.284e-01   2.191  0.02992 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3397 on 159 degrees of freedom
## Multiple R-squared:  0.828,  Adjusted R-squared:  0.8226 
## F-statistic: 153.1 on 5 and 159 DF,  p-value: < 2.2e-16
par(mar=c(2,2,2,2))
hist(model3$residuals)

qqnorm(model3$residuals,ylab = "Residuals")
qqline(model3$residuals)

pred <- model3 %>%
  predict(test)
R2 <- R2(pred,test$price)
R2
## [1] 0.8035101
Conclusion:

The second model (LR with significant values attributes) shows a slightly higher R Squared adjusted value compared to the LR model with high-correlation attributes. Moreover, the R2 value of the test set is slightly higher in the high-significant value attributes. The normal distribution of the error is also closer to zero as was the case in the previous section. However, the variance in this model is slightly lower as shown by the Normal-Q-Q graph. The points are closer to the line and hence, the model with significant values is the better model compared to the LR model with high correlation attributes. In addition, this model cannot be compared with the initial model with the train set as the entire dataset as the size is larger and as mentioned in the previous section, the comparison will not be valid.

Therefore, I conclude that this model with the five significant attributes is performing well with an R-Squared adjusted value of 0.83, and a R2 value of 0.80 in test set.